2  What is Bayesian statistical inference?

2.1 Introduction

2.1.1 Today’s topics

  1. What is Bayesian statistical inference?
  2. Why is it useful in general?
  3. Why is it useful in systems biology?
  4. The big challenge

2.1.2 Computer goals

Set up git/ssh, python, cmdstanpy and cmdstan

2.2 What is Bayesian statistical inference?

2.2.1 Probability function

A function that can measure the water in a jug.

i.e. \(p: S \rightarrow [0,1]\) where:

  • \(p(S) = 1\)
  • For disjoint \(A, B \in S\) \(p(A\cup B) = p(A) + p(B)\)

2.2.2 Statistical Inference

In: facts about a spoonful sample

Out: propositions about a soup population

e.g.

  • spoonful not salty \(\rightarrow\) soup not salty
  • no carrots in spoon \(\rightarrow\) no carrots in soup

2.2.3 Bayesian statistical inference

Statistical inference resulting in a probability.

e.g.

  • spoon \(\rightarrow\) \(p(\text{soup not salty})\) = 99.9%
  • spoon \(\rightarrow\) \(p(\text{no carrots in soup})\) = 95.1%

Non-Bayesian inferences:

  • spoon \(\rightarrow\) Best estimate of [salt] is 0.1mol/l
  • \(p_{null}(\text{spoon})\) = 4.9% \(\rightarrow\) no carrots (p=0.049)

3 Why is Bayesian statistical inference useful in general?

3.0.1 The philosophical reason

Bayesian inference can be interpreted in terms of information and plausible reasoning.

e.g. “According to the model…”

  • “…x is highly plausible.”
  • “…x is more plausible than y.”
  • “…the data doesn’t contain enough information for firm conclusions about x.”

3.0.2 Mathematical reason

Bayesian inference is old!

This means

  • it is well understood mathematically.
  • conceptual surprises are relatively rare.
  • there are many compatible frameworks.

3.0.3 General practical reason

Probabilities decompose nicely:

\[ p(\theta, y) = p(\theta)p(y\mid\hat{y}(\theta)) \]

  • \(p(\theta)\): nice form for background information, e.g. anything non-experimental
  • \(\hat{y}(\theta)\): nice form for structural information, e.g. physical laws
  • \(p(y\mid\hat{y}(\theta))\): nice form for measurement information, e.g. instrument accuracy

3.1 Why is Bayesian inference useful in systems biology?

3.1.1 Regression models: good for describing measurements

Idea: measured value systematically but noisily depends on the true value e.g.

\(y \sim N(\hat{y}, \sigma)\)

Bayesian inference lends itself to regression models that accurately describe details of the measurement process. e.g.

  • heteroskedasticity \(y \sim N(\hat{y}, \sigma(\hat{y}))\)
  • non-negativity \(y \sim LN(\ln{\hat{y}}, \sigma)\) (also compositionality)
  • unknown bias \(y \sim N(\hat{y} + q, \sigma)\)

3.1.2 Multi-level models: good for describing sources of variation

Measurement model:

\(y \sim binomial(K, logit(ability))\)

Gpareto model:

\(ability \sim GPareto(m, k, s)\)

Normal model:

\(ability \sim N(\mu, \tau)\)

3.1.3 Generative models: good for representing structural information

Information about hares (\(u\)) and lynxes (\(v\)):

\[\begin{align*} \frac{d}{dt}u &= (\alpha - \beta v)u \\ \frac{d}{dt}v &= (-\gamma + \delta u)v \end{align*}\]

i.e. a deterministic function turning \(\alpha\), \(\beta\), \(\gamma\), \(\delta\), \(u(0)\) and \(v(0)\) into \(u(t)\) and \(v(t)\).

3.2 The big challenge

3.2.1 The big challenge

\(p(\theta \mid y)\) is easy to evaluate but hard to integrate.

This is bad as we typically want something like

\[ p([salt] < 0.1, spoon=s) \]

which is equivalent to

\[ \int_{0}^{0.1}p([salt], spoon=s)d[salt] \]

\(p(\theta \mid y)\) has one dimension per model parameter.

3.2.2 The solution: MCMC

Strategy:

  1. Find a series of numbers that
    • quickly finds the high-probabiliy region in parameter space
    • reliably matches its statistical properties
  2. Do sample-based approximate integration.

It (often) works!

We can tell when it doesn’t work!

An image I found online:

3.3 Homework

3.3.1 Things to read

Box and Tiao (1992) Ch. 1.1 (available from dtu findit) gives a nice explanation of statistical inference in general and why Bayes.

Historical interest:

3.3.2 Things to set up

3.3.2.1 Python

First get a recent (ideally 3.10+) version of Python This can be very annoying so talk to me if necessary!

Next get used to Python virtual environments.

The method I like is to put the virtual environment in a folder .venv inside the root of my project:

$ python -m venv .venv --prompt=bssb

Then to use:

Tip: use an ergonomic alias to activate venvs e.g. alias va="source .venv/bin/activate"
$ source .venv/bin/activate
# ... do work
$ deactivate

3.3.2.2 Git and ssh

git clone git@github.com:teddygroves/bayesian_statistics_for_systems_biologists.git

3.3.2.3 Cmdstanpy and cmdstan

from cmdstanpy import CmdStanModel
filename = "example_stan_program.stan" 
code = "data {} parameters {real t;} model {t ~ std_normal();}"
with open(filename, "w") as f:
    f.write(code)
model = CmdStanModel(stan_file=filename)
mcmc = model.sample()

4 Next time

4.0.0.1 Theory

Hamiltonian Monte Carlo:

  • what?
  • why?

MCMC diagnostics

4.0.0.2 Computer

Stan, cmdstanpy, arviz:

  • formats
  • workflow
  • write a model